-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Celery worker concurrency setting #1010
Conversation
b44aee8
to
d02b5ed
Compare
1dd4b5a
to
0382877
Compare
@regisb, mind taking a look? |
@@ -141,7 +141,7 @@ spec: | |||
containers: | |||
- name: cms-worker | |||
image: {{ DOCKER_IMAGE_OPENEDX }} | |||
args: ["celery", "--app=cms.celery", "worker", "--loglevel=info", "--hostname=edx.cms.core.default.%%h", "--max-tasks-per-child", "100", "--exclude-queues=edx.lms.core.default"] | |||
args: ["celery", "--app=cms.celery", "worker", "--loglevel=info", "--hostname=edx.cms.core.default.%%h", "--concurrency={{ OPENEDX_CMS_CELERY_WORKERS }}", "--max-tasks-per-child", "100", "--exclude-queues=edx.lms.core.default"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The --concurrency
argument specifies the number of worker processes. This is not an ideal practice in Kubernetes environments because:
- Kubernetes prefers to manage scalability and replication at the container orchestration level, using the
replicas
field in a Deployment to manage the number of pod instances. - Setting
--concurrency
inside a container limits the scalability to the process level inside the pod, rather than allowing Kubernetes to manage multiple pods across nodes for better fault tolerance and load distribution. - It violate "one process per container" principle. This is important because with multiple processes in the same container, it is harder to troubleshoot the container because logs from different processes will be mixed together, and it is harder to manage the processes lifecycle, etc.
So, it's better to just make a hardcode--concurrency=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That all makes sense, but it might be counter-intuitive to have a configuration item that works for one deployment scenario but not another.
I mean, we have OPENEDX_CMS_UWSGI_WORKERS
, and that's also configurable for Kubernetes. 🤷🏼
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, it would be appropriate to add to the documentation a mention that setting --concurrency=1
for K8s is recommended not in the context of saving resources, but in the context of proper resource management. What do you think about this? It will be useful for DevOps to pay attention to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My opinion is that we should allow the operator to decide what is the size of its services/pods. For the LMS/Studio by defining the OPENEDX_CMS_UWSGI_WORKERS
and on celery workers the same principle is applied by adding the OPENEDX_CMS_CELERY_WORKERS
variable. Bigger pods could allow some installations to optimize for their case.
Nevertheless, my Kubernetes deployment uses --concurrency=1
, with an Horizontal Pod Autoscaling configuration.
@@ -250,7 +250,7 @@ spec: | |||
containers: | |||
- name: lms-worker | |||
image: {{ DOCKER_IMAGE_OPENEDX }} | |||
args: ["celery", "--app=lms.celery", "worker", "--loglevel=info", "--hostname=edx.lms.core.default.%%h", "--max-tasks-per-child=100", "--exclude-queues=edx.cms.core.default"] | |||
args: ["celery", "--app=lms.celery", "worker", "--loglevel=info", "--hostname=edx.lms.core.default.%%h", "--concurrency={{ OPENEDX_LMS_CELERY_WORKERS }}", "--max-tasks-per-child=100", "--exclude-queues=edx.cms.core.default"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same problem
@@ -158,7 +158,7 @@ services: | |||
environment: | |||
SERVICE_VARIANT: lms | |||
DJANGO_SETTINGS_MODULE: lms.envs.tutor.production | |||
command: celery --app=lms.celery worker --loglevel=info --hostname=edx.lms.core.default.%%h --max-tasks-per-child=100 --exclude-queues=edx.cms.core.default | |||
command: celery --app=lms.celery worker --loglevel=info --hostname=edx.lms.core.default.%%h --concurrency={{ OPENEDX_LMS_CELERY_WORKERS }} --max-tasks-per-child=100 --exclude-queues=edx.cms.core.default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docker-compose also provides mechanisms for managing replicas. Therefore, it is also better to make --concurrency=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, yes. I'd be glad to review a PR that does that instead. I just need a way to reduce RAM usage for development. ;)
@@ -177,7 +177,7 @@ services: | |||
environment: | |||
SERVICE_VARIANT: cms | |||
DJANGO_SETTINGS_MODULE: cms.envs.tutor.production | |||
command: celery --app=cms.celery worker --loglevel=info --hostname=edx.cms.core.default.%%h --max-tasks-per-child 100 --exclude-queues=edx.lms.core.default | |||
command: celery --app=cms.celery worker --loglevel=info --hostname=edx.cms.core.default.%%h --concurrency={{ OPENEDX_CMS_CELERY_WORKERS }} --max-tasks-per-child 100 --exclude-queues=edx.lms.core.default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docker-compose also provides mechanisms for managing replicas. Therefore, it is also better to make --concurrency=1
Not sure if it is Mac thing but I can't see anything against |
docs/dev.rst
Outdated
--set OPENEDX_CMS_CELERY_WORKERS=1 \ | ||
--set OPENEDX_LMS_CELERY_WORKERS=1 \ | ||
--set OPENEDX_CMS_UWSGI_WORKERS=1 \ | ||
--set OPENEDX_LMS_UWSGI_WORKERS=1 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather avoid asking users to manually set these values. Instead, we should automatically default to workers=1 in development. Can we do that? For instance by overriding the celery config in development?
docs/dev.rst
Outdated
--set OPENEDX_LMS_CELERY_WORKERS=1 \ | ||
--set OPENEDX_CMS_UWSGI_WORKERS=1 \ | ||
--set OPENEDX_LMS_UWSGI_WORKERS=1 \ | ||
--set ELASTICSEARCH_HEAP_SIZE=100m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: can we automatically set this value in development?
@@ -149,6 +149,11 @@ This defines the version that will be pulled from just the Open edX platform git | |||
|
|||
By default, there are 2 `uwsgi worker processes <https://uwsgi-docs.readthedocs.io/en/latest/Options.html#processes>`__ to serve requests for the LMS and the CMS. However, each worker requires upwards of 500 Mb of RAM. You should reduce this value to 1 if your computer/server does not have enough memory. | |||
|
|||
- ``OPENEDX_LMS_CELERY_WORKERS`` (default: ``"0"``) | |||
- ``OPENEDX_CMS_CELERY_WORKERS`` (default: ``"0"``) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding new configuration settings to Tutor core is a personal trigger of mine 🧨 Do we really want to make changes to the default production values? If yes, can we:
- propose better defaults?
- make these custom changes possible via a patch instead of two new configuration settings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I understand the reluctance to add new config items. It's just that in this case, it was the most straightforward way to achieve what I was after. There's precedent, too: OPENEDX_LMS_UWSGI_WORKERS
is there for very similar reasons.
Regarding the defaults, I'm not actually changing them: I'm just making them explicit, where before they were implicit. (The implicit default is to scale the workers to however many CPUs you have, and that's what "0" means.)
As for using patches, I wouldn't mind except for the fact that, as mentioned above, this is just doing what OPENEDX_LMS_UWSGI_WORKERS
does, except for Celery workers. If we have that configuration, I don't see why we shouldn't have this one.
All of this said, I really like the idea of changing certain things automatically for development environments, whether they have corresponding config items or not. For instance, after I issued this PR it came to my attention that Tutor's importing *
from devstack.py for the development settings, and that means that we aren't using Celery workers at all! (See https://github.com/openedx/edx-platform/blob/master/lms/envs/devstack.py#L35.) So why is tutor dev
even firing up workers?
In any case, the latter sounds like it warrants a separate PR. My question regarding this PR, though, is whether we do or do not want OPENEDX_LMS_CELERY_WORKERS
. It might not make sense to change this in a Kubernetes setting, but I'm willing to defend that it does on any Docker deployment where you have more CPUs than you have RAM (so to speak).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this for not starting workers at all in dev mode? #1041
The question remains whether we still want to let people configure the number of Celery workers manually. (I say we let them.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the fact that we can disable workers in dev. I commented on #1041.
Let's now focus on the possibility to customize the number of celery runners. I agree that this would be a useful feature. If we really have to, we'll introduce new configuration values, but I'd like to see if we can avoid it. For instance, could we avoid that by creating a celery config file? This file would include a {{ patch("edx-platform-celery-config") }}
statement. That way, we wouldn't have to create new configuration settings for every celery parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@regisb
Agree! There should be an option to disable workers in dev, it could be disabled by default on dev mode. Personally, I like of having a config file with a patch. By default the config file should have the minimum config to start.
Nevertheless, I feel that won't resolve every additional configuration:
- Missing config options for --without-mingle and --without-gossip celery/celery#2566
- https://docs.celeryq.dev/en/stable/reference/cli.html
- https://docs.celeryq.dev/en/stable/userguide/configuration.html this won't have a way to disable gossip, mingle, heartbeat and configure (vertical) autoscale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my answer here: #1126 (comment)
I propose that all remaining comments are made on issue #1126
@arbrandes Hi, there are a few to-be-addressed comments added by Régis. Please take a look when you get a chance. Thanks. |
This allows the user to configure how many Celery workers are spawned independently of how many CPUs there are in the system. The default is to spawn as many workers as there are CPUs, which in some cases can consume too many resources. (The setting should be particularly useful to people running Tutor for development on Linux machines, where reducing the concurrency to "1" can reduce RAM usage significantly.)
0382877
to
a77c999
Compare
@arbrandes Hi, what's the plan for this PR? Thanks |
I can look into adding a Celery conf file patch, but since I'm not using this PR (as opposed to the one that disables workers in dev mode), it'll probably take a while to get to. |
I'm closing this because it seems we're gonna go with a patch/config file solution. The conversation should continue on #1126. |
This allows the user to configure how many Celery workers are spawned independently of how many CPUs there are in the system. The default is to spawn as many workers as there are CPUs, which in some cases can consume too many resources.
The setting should be particularly useful to people running Tutor for development on Linux machines, where reducing the concurrency to "1" can reduce RAM usage significantly.
Testing
Before running this branch, launch a Tutor environment and count how many celery process there are. With something like:
You should get twice the number of CPUs on the system - one set for each of LMS and CMS - plus two parent processes. (On my machine, which has 12 real cores + 12 virtual ones, the number comes out to 26.)
Stop the environment, install this branch, and set:
Relaunch the environment. The worker containers should be recreated.
Check the number of celery processes. There should now be just 4.
It's worth checking RAM usage, too. Before, my dev environment would take up 8 or more gigabytes of RAM. After, it takes less than 3.